Adding fitted line to scatterplot in Python (4 Examples)

In this tutorial we are going to show several ways to add fitted line to scatter plot using popular visualization packages in Python. In order to add fitted line it is needed first to approximate the line using some regression methods. User can either compute the line by ones own or utilize built-in tools for this purpose. We are going to investigate both ways of adding the fitted line to scatterplot. With Matplotlib and Bokeh we are going to visualize fitted lines obtained via our custom functions. In the case of Plotly and Seaborn we are going to use built-in approximators. Also we will demonstrate options to add interactivity to plots with Bokeh and Plotly packages. Additionally since we are trying to create similar plots using different tools it will be possible to compare those tools from the point of view of ease of use and flexibility. Shortly, the content is organised in the following way:

IMPORT PACKAGES

GENERATE DATA

CALCULATE FITTED LINE

1) ADD FITTED LINE TO SCATTERPLOT USING MATPLOTLIB

2) ADD FITTED LINE TO SCATTERPLOT USING BOKEH

3) ADD FITTED LINE TO SCATTERPLOT USING PLOTLY

4) ADD FITTED LINE TO SCATTERPLOT USING SEABORN

IMPORT PACKAGES

As it was mentioned above we are going to use 4 popular visualization tools. Below you can find all necessary functions imported from the packages in order to complete the task successfully. Some of the functions imported here or defined later are helper functions and have been used in order to organize the code and the visalized output in a better way.

TABLE OF CONTENTS

GENERATE DATA

We are going to generate linear, quadratic and plonomial function with order of 5 on a grid of 20 values within the range from -1 to 1.

TABLE OF CONTENTS

CALCULATE FITTED LINE

Below we define two functions that approximate the fitting line based on our data. The first function approximates coeffitients of original equations using least squares method which is implemented in code from scratch. The second one utilizes built-in function from numpy package. np.polyfit() also is least squares method and returns a vector of coefficients that minimises the squared error.

TABLE OF CONTENTS

ADD FITTED LINE TO SCATTERPLOT USING MATPLOTLIB

As the first visualization tool we are going to use matplotlib. Below we define a grid of subplots consisting of 6 plots visualizing linear, quadratic and polinomial fitted lines computed from scratch and with the help of numpy package. After initializing a figure and axes, and setting spacing values between plots we define __visualize_using_matplotlib__ function where we describe all the neccessary steps. First we should add scatterplot and fitted line using __scatter__ and __plot__ functions from matplotlib. Then it is needed to add some description in order to make plots comprehensible. In this example we define __x__ and __y__ lables and add legend as a description for the method of approximation for fitted line.

TABLE OF CONTENTS

ADD FITTED LINE TO SCATTERPLOT USING BOKEH

__Bokeh__ is a widely used visualiztion tool, which creates interactive plots and allows to export them into __html__ format without loosing the interactivity of the rendered visualizations. In this example as a demonstration of interactivity we add a __HoverTool__. This tool allows us to get some information about plotted data while hovering over the point on the grid set. 

As a base of most of the bokeh visualizations we define __ColumnDataSource__ which provides the data to the glyphs of the plots. Then we initialize the figure object where we define size, title and other general parameters of the created plot. 

Further we create a scatter plot using a __circle__ object which configure and add __Scatter__ glyphs. We should add reference to our data using source parameter of this object. __x__ and __y__ parameters define centers of the circles on the plot based on the data received from the data source. The __line__ glyph visualize fitted line provided from data source.

Finally we add a __HoverTool__ using add_tools method. The structure of this tool is quite complex. By default, the hover tool displays informational tooltips whenever the cursor is directly over a glyph. The data to show comes from the glyph’s data source, and what to display is configurable with the tooltips property that maps display names to columns in the data source, or to special known variables. Here we also use 'CustomJSHover' which allows us to create custom formatter to apply to a hover tool field. In this custom formatter we retrieve data-space x and y coordinates for the hovering glyph.

TABLE OF CONTENTS

ADD FITTED LINE TO SCATTERPLOT USING PLOTLY

Another visualization package which provides interactive plots is __Plotly__. For our example we can achive the same functionality with fewer lines of code compared to __Bokeh__. As a main tool we use scatter function from express module in __Plotly__. Every __Plotly__ Express function uses graph objects internally and returns a __plotly.graph_objects.Figure instance__. 
In our use case we are interested in high-level feature of this module called __trendline__. This feature provides built-in approximators so we can fit line to our data without providing side computations. We should just pass the alias of regression method we want to apply. For the linear data we used OLS based trendline. In quadratic and polinomial case we used built in __LOWESS__ approximator. For the detailed description of the indicated method you can refer to the official documentation. 
In order to group plots in a line we use __make_subplots__ function. Please note that __px.scatter__ function returns __plotly.graph_objects.Figure__ wich encapsulates all the objects displayed on the plot. In order to be able to add it into trace we need to pass __Scatter__ objects one by one.
It is also possible to implement the same visualization using only one call of __px.scatter__ function. In this case you will need to melt dataframe and set it to the following shape [['X','variable','values']]. 'X' contains same values as it was in original dataset the only difference is that 'X' values are repeated for each category of y data [[y_linear, y_quadratic, y_poly]]. 'variable' column consist of [[y_linear, y_quadratic, y_poly]] categories and 'values' column consits of their corresponding values. Additionaly parameter facet_col='variable' should be passed to the scatter function and trendline_scope should be defined as 'trace'(currently default). The drawback of the last method is that user can select only one trendline function, which is applied to all the categories of our data.

TABLE OF CONTENTS

ADD FITTED LINE TO SCATTERPLOT USING SEABORN

Seaborn is a visualization package which is built on top of __matplotlib__ and is integrated closely with __pandas__. Similar to matplotlib it does not provide prebuilt interactive tools which is possible to export in __html__ format. In contrast to matpltotlib seaborn mostly focused on statistical graphics and provides built-in statistical computation. In our example we use __regplot__ function which allows us to fit line to our data passing only 3 parameters (__x__,__y__ and order of polinomial).

TABLE OF CONTENTS

THANK YOU FOR ATTENTION!